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Abstract 

We study computing geometric problems on uncertain points. An uncertain point is a point that does not have a fixed 
location, but rather is described by a probability distribution. When these probability distributions are restricted to a 
finite number of locations, the points are called indecisive points. In particular, we focus on geometric shape-fitting 
problems and on building compact distributions to describe how the solutions to these problems vary with respect to 
the uncertainty in the points. Our main results are: (1) a simple and efficient randomized approximation algorithm for 
calculating the distribution of any statistic on uncertain data sets; (2) a polynomial, deterministic and exact algorithm 
for computing the distribution of answers for any LP-type problem on an indecisive point set; and (3) the development 
of shape inclusion probability (SIP) functions which captures the ambient distribution of shapes fit to uncertain or 
indecisive point sets and are admissible to the two algorithmic constructions. 



1 Introduction 

In gathering data there is a trade-off between quantity and accuracy. The drop in the price of hard drives 
and other storage costs has shifted this balance towards gathering enormous quantities of data, yet with 
noticeable and sometimes intentionally tolerated increased error rates. However, often as a benefit from the 
large data sets, models are developed to describe the pattern of the data error. 

Let us take as an example Light Detection and Ranging (LIDAR) data gathered for Geographic Information 
Systems (GIS) [ID], specifically height values at millions of locations on a terrain. Each data point (x,y, z) 
has an x- value (longitude), a y- value (latitude), and a z- value (height). This data set is gathered by a small 
plane flying over a terrain with a laser aimed at the ground measuring the distance from the plane to the 
ground. Error can occur due to inaccurate estimation of the plane's altitude and position or artifacts on 
the ground distorting the laser's distance reading. But these errors are well-studied and can be modeled by 
replacing each data point with a probability distribution of its actual position. Greatly simplifying, we could 
represent each data point as a 3-variate normal distribution centered at its recorded value; in practice, more 
detailed uncertainty models are built. 

Similarly, large data sets are gathered and maintained for many other applications. In robotic mapping |55l 
[22] error models are provided for data points gathered by laser range finders and other sources. In data 
mining [T] [SJ original data (such as published medical data) are often perturbed by a known model to preserve 
anonymity. In spatial databases 28, 53, 15 large data sets may be summarized as probability distributions to 
store them more compactly. Data sets gathered by crawling the web have many false positives, and allow for 
models of these rates. Sensor networks [2U] stream in large data sets collected by cheap and thus inaccurate 
sensors. In protein structure determination |51j every atom's position is imprecise due to inaccuracies in 
reconstruction techniques and the inherent flexibility in the protein. In summary, there are many large data 
sets with modeled errors and this uncertainty should be dealt with explicitly. 
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Fig. 1: (a) An example input consisting of n = 3 sets of k = 6 points each, (b) One of the 6 3 possible samples of n = 3 
points. 

1.1 The Input: Geometric Error Models 

The input for a typical computational geometry problem is a set P of n points in K 2 , or more generally M. d . 
In this paper we consider extensions of this model where each point is also given a model of its uncertainty. 
This model describes for each point a distribution or bounds on the point's location, if it exists at all. 

• Most generally, we describe these data as uncertain points 7 = {Pi, P2, ■ ■ ■ , P n }- Here each point's 
location is described by a probability distribution /x, (for instance by a Gaussian distribution). This 
general model can be seen to encompass the forthcoming models, but is often not worked with directly 
because of the computational difficulties arisen from its generality. For instance in tracking uncertain 
objects a particle filter uses a discrete set of locations to model uncertainty [JS] while a Kalman filter 
restricts the uncertainty model to a Gaussian distribution [33] , 

• A more restrictive model we also study in this paper are indecisive points where each point can take 
one of a finite number of locations. To simplify the model (purely for making results easier to state) we 
let each point have exactly k possible locations, forming the domain of a probability distribution. That 
is each uncertain point Pi is at one of {pi,i,Pi,2, ■ ■ ■ ,Pi,k}- Unless further specified, each location is 
equally likely with probability 1/fc, but we can also assign each location a weight Wij as the probability 
that P i is at pij where Y]j—i u?tj = 1 for all i. 

Indecisive points appear naturally in many applications. They play an important role in databases |19[ 
[THJ [TCI [Ml 121 02] ! machine learning [TO] , and sensor networks [55] where a limited number of probes 
from a certain data set are gathered, each potentially representing the true location of a data point. 
Alternatively, data points may be obtained using imprecise measurements or are the result of inexact 
earlier computations. However, the results with detailed algorithmic analysis generally focus on one- 
dimensional data; furthermore, they often only return the expected value or the most likely answer 
instead of calculating a full distribution. 

• An imprecise point is one where its location is not known precisely, but it is restricted to a range. 
In one-dimension these ranges are modeled as uncertainty intervals, but in 2 or higher dimensions 
they become geometric regions. An early model to quantify imprecision in geometric data, motivated 
by finite precision of coordinates, is e-geometry, introduced by Guibas et. al. [22j, where each point 
was only known to be somewhere within an e-radius ball of its guessed location. The simplicity of 
this model has provided many uses in geometry. Guibas et. al. [27] define strongly convex polygons: 
polygons that are guaranteed to stay convex, even when the vertices are perturbed by e. Bandyopadhyay 
and Snoeyink [5] compute the set of all potential simplices in M 2 and K 3 that could belong to the 
Dclaunay triangulation. Held and Mitchell [3T] and Loffler and Snoeyink J5JJ study the problem of 
preprocessing a set of imprecise points under this model, so that when the true points are specified 
later some computation can be done faster. 

A more involved model for imprecision can be obtained by not specifying a single e for all the points, but 
allowing a different radius for each point, or even other shapes of imprecision regions. This allows for 
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modeling imprecision that comes from different sources, independent imprecision in different dimensions 
of the input, etc. This extra freedom in modeling comes at the price of more involved algorithmic 
solutions, but still many results are available. Nagai and Tokura |47j compute the union and intersection 
of all possible convex hulls to obtain bounds on any possible solution, as does Ostrovsky-Berman and 
Joskowicz [JSJ in a setting allowing some dependence between points. Van Kreveld and Loftier [37J study 
the problem of computing the smallest and largest possible values of several geometric extent measures, 
such as the diameter or the radius of the smallest enclosing ball, where the points are restricted to lie 
in given regions in the plane. Kruger [38] extends some of these results to higher dimensions. 

Although imprecise points do not traditionally have an associated probability distribution associated to 
them, we argue that they can still be considered a special case of our uncertain points, since we can 
impose e.g. a uniform distribution on the regions, and then ask question about the smallest or largest 
non-zero probability values of some function, which would correspond to bounds in the classical model. 

• A stochastic point p has a fixed location, but which only exists with a probability p. These points arise 
naturally in many database scenarios [7J 117] where gathered data has many false positives. Recently in 
geometry Kamousi, Chan, and Suri [35] |33] considered geometric problems on stochastic points and 
geometric graphs with stochastic edges. These stochastic data sets can be interpreted as uncertain 
point sets as well by allowing the probability distribution governing uncertain points to have a certain 
probability of not existing, or rather the integral of the distribution is p instead of always 1. 

1.2 The Output: Distributional Representations 

This paper studies how to compute distributions of statistics over uncertain data. These distributions can 
take several forms. In the simplest case, a distribution of a single value has a one-dimensional domain. The 
technical definition yields a simpler exposition when the distribution is represented as a cumulative density 
function, which we refer to as a quantization. This notion can be extended to a multi-dimensional cumulative 
density function (a k-variate quantization) as we measure multiple variables simultaneously. Finally, we 
also describe distributions over shapes defined on uncertain points (e.g. minimum enclosing ball). As the 
domains of these shape distributions are a bit abstract and difficult to work with, we convey this information 
as a shape inclusion probability or SIP; for any point in the domain of the input point sets we describe the 
probability the point is contained in the shape. 

This model of uncertain data has been studied in the database community but for different types of 
problems on usually one-dimensional data, such as indexing j2] [54] [32] , ranking [17] , nearest neighbors [M] 
and creating histograms |16j . 

1.3 Contributions 

For each type of distributional representation of output we study, the goal is a function from some domain 
to a range of [0, 1]. For the general case of uncertain points, we provide simple and efficient, randomized 
approximation algorithms that results in a function that everywhere has error at most e. Each variation of 
the algorithm runs in 0{{l/e 2 ){v + log(l/<5))T) time and produces an output of size 0((l/e 2 )(^ + log(l/<5)) 
where v describes the complexity of the output shape (i.e. VC-dimension) , S is the probability of failure, 
and T is the time it takes to compute the geometric question on certain points. These results are quite 
practical as experimental results demonstrate that the constant for the big-Oh notation is approximately 0.5. 
Furthermore, for one-dimensional output distributions (quantizations) the size can be reduced to 1/e, and for 
A:-dimensional distributions to 0((k/e) log 4 (l/e)). We also extend these approaches to allow for geometric 
approximations based on a-kernels [4] [3] . 

For the case of indecisive points, we provide deterministic and exact, polynomial-time algorithms for all 
LP-type problems with constant combinatorial dimension (e.g. minimum enclosing ball). We also provide 
evidence that for problems outside this domain, deterministic exact algorithms are not available, in particular 
showing that diameter is ^P-hard despite having an output distribution of polynomial size. Finally, we 
consider deterministic algorithms for uncertain point sets with continuous distributions describing the location 
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of each point. We describe a non-trivial range space on these input distributions from which an e-samplc 
creates a set of indecisive points, from which this algorithm can be performed to deterministically create an 
approximation to the output distribution. 



2 Preliminaries 

This section provides formal definitions for existing approximation schemes related to our work as well as the 
for our output distributions. 

2.1 Approximation Schemes: ^-Samples and a-Kernels 

This work allows for three types of approximations. The most natural in this setting is controlled by a 
parameter e which denotes the error tolerance for probability. That is an e-approximation for any function 
with range in [0, 1] measuring probability can return a value off by at most an additive e. The second type of 
error is a parameter 5 which denotes the chance of failure of a randomized algorithm. That is a (5-approximate 
randomized algorithm will be correct with probability at least 1 — 5. Finally, in specific contexts we allow a 
geometric error parameter a. In our context, an a-approximate geometric algorithm can allow a relative 
a-error in the width of any object. This is explained more formally below. It should be noted that these three 
types of error cannot be combined into a single term, and each needs to be considered separately. However, 
the e and 5 parameters have a well-defined trade-off. 

e-Samples. For a set P let A be a set of subsets of P. In our context usually P will be a point set and 
the subsets in A could be induced by containment in a shape from some family of geometric shapes. For 
example, J + describes one-sided intervals of the form (— oo,t). The pair (P,A) is called a range space. We 
say that Q C P is an e-sample of (P, A) if 



where | • | takes the absolute value and </>(•) returns the measure of a point set. In the discrete case <fr(Q) 
returns the cardinality of Q. We say A shatters a set S if every subset of S is equal to R D S for some R E A. 
The cardinality of the largest discrete set S C P that A can shatter is the VC- dimension of (P,A). 

When (P, A) has constant VC-dimension v, we can create an e-sample Q of (P,A), with probability 1 — 6, 
by uniformly sampling 0((l/£ 2 )(v + log(l/J))) points from P [57J There exist deterministic techniques 
to create e-samples [331 U3] of size 0(v(l/e 2 ) log(l/e)) in time 0(v 3 "n((l/e 2 ) \og{u/e)) v ). A recent result of 
Bansal [9] (also see this simplification [42]) can slightly improve this bound to Oil/ ' e 2 ~ 1 l 2v ), following an older 
existence proof [45], in time polynomial in n and 1/e. When P is a point set in M. d and the family of ranges 
3^ is determined by inclusion in axis-aligned boxes, then an e-sample for (P, 3^) of size 0((d/e) log 2d (l/e)) 
can be constructed in 0((n/e 3 ) log 6rf ( 1/e)) time [15] . 

For a range space (P, A) the dual range space is defined (A, P*) where P* is all subsets A p C A defined 
for an element p E P such that A p = {A E A | p E ^4}. If (P,A) has VC-dimension v, then (A,P*) has 
VC-dimension < 2 V+1 . Thus, if the VC-dimension of (A, P*) is constant, then the VC-dimension of (P, A) is 
also constant [44] . 

When we have a distribution fi : R d — > M + , such that f xeR /j,(x) dx — 1, we can think of this as the set P 
of all points in K d , where the weight w of a point p E M. d is n(p). To simplify notation, we write (n,A) as a 
range space where the ground set is this set P = M. d weighted by the distribution /i. 

a-Kernels. Given a point set P G M. d of size n and a direction u E S d_1 , let P[u] — argmax pe p(p, u), where 
(•, •) is the inner product operator. Let w(P, u) = (P[u\ — P[—u],u) describe the width of P in direction u. 
We say that K C P is an a-kernel of P if for all u E § d_1 



<j>{Rr\Q) <j>(Rr\P) 



oj(P,u) 



uj(K,u) < a ■ uj(P,u). 



4 



(a) (b) (c) 

Fig. 2: (a) The true form of a monotonically increasing function from R — > E. (b) The e-quantization R as a point set in 
R. (c) The inferred curve Hr in K 2 . 

a-kernels of size 0(1/ 'a^ d_1 ^ 2 ) [4] can be calculated in time 0(n + l/a d_3 / 2 ) [L2"l [58]. Computing many 
extent related problems such as diameter and smallest enclosing ball on K approximates the problem on 

p @l El El- 

2.2 Problem Statement 

Let Hi : M. d —> R + describe the probability distribution of an uncertain point Pj where the integral 
/ eR d ^*(?) dq — 1. We say that a set Q of n points is a support from 3 if it contains exactly one 
point from each set P,, that is, if Q = {q±, q%, . . . , q n } with % € Pj. In this case we also write Q <e T. Let 
^ty : M. d x M d x . . . x R d — > R + describe the distribution of supports Q — {qi, q^ . . . , q n } under the joint 
probability over each qi £ Pi . For brevity we write the space K d x . . . x M. d as R dn . For this paper we will 
assume /Lty(gi, q<z, . . ■ , q n ) — Ti7=i (?»)) so ^ ne distribution for each point is independent, although this 
restriction can be easily removed for all randomized algorithms. 

Quantizations and their approximations. Let / : W ln — > K fc be a function on a fixed point set. Examples 
include the radius of the minimum enclosing ball where k = 1 and the width of the minimum enclosing 
axis-aligned rectangle along the a:- axis and j/-axis where k — 2. Define the "dominates" binary operator -< so 
that (pi, . . . ,pk) di (vi,..., Vk) is true if for every coordinate pi < u<. Let X/(w) = {Q £ R dn \ f(Q) -< v}. 
For a query value v define, F^ p (v) — jQ e x f ( v ) ^p{Q) dQ- Then P MP is the cumulative density function of the 
distribution of possible values that / can takepj We call P MP a quantization of / over //p. 

Ideally, we would return the function P MP so we could quickly answer any query exactly, however, for 
the most general case we consider, it is not clear how to calculate P Mp (v) exactly for even a single query 
value v. Rather, we introduce a data structure, which we call an e-quantization, to answer any such query 
approximately and efficiently illustrated in Figure [2] for k = 1. An e-quantization is a point set R C R fe 
which induces a function h,R where hn(v) describes the fraction of points in R that v dominates. Let 
R v = {r £ R I r ^ v). Then hji(v) — \R V \/\R\. For an isotonic (monotonically increasing in each coordinate) 
function P Mp and any value v, an e-quantization, P, guarantees that \hn(v) — P Alp (u)| < e. More generally 
(and, for brevity, usually only when k > 1), we say R is a fc-variate e-quantization. An example of a 2-variate 
e-quantization is shown in Figure [3] The space required to store the data structure for R is dependent only 
on e and k, not on |P| or fip- 

(e, S, a)-Kernels. Rather than compute a new data structure for each measure we are interested in, we can also 
compute a single data structure (a coreset) that allows us to answer many types of questions. For an isotonic 
function P Mp : E + — > [0, 1], an (e, a) -quantization data structure M describes a function Km ■ K + — > [0, 1] 
so for any x £ K + , there is an x' £ K + such that (1) \x — x'\ < ax and (2) \Hm{%) — F^ P { X ')\ < £ - An 
(e, 6, a) -kernel is a data structure that can produce an (e, a)-quantization, with probability at least 1 — S, for 
F^p where / measures the width in any direction and whose size depends only on e, a, and S. The notion of 
(e, a)-quantizations is generalizes to a fc-variate version, as do (e, 5, a)-kernels. 

H For a function / and a distribution of point sets fip , we will always represent the cumulative density function of / over fip 
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(a) (b) (c) (d) 

Fig. 3: (a) The true form of a a;t/-monotone 2-variate function, (b) The e-quantization R as a point set in R 2 . (c) The 
inferred surface hn in R 3 . (d) Overlay of the two images. 



Shape inclusion probabilities. A summarizing shape of a point set P C M. d is a Lebesgue-measureable 
subset of E d that is determined by P. I.e. given a class of shapes §, the summarizing shape S(P) € § is the 
shape that optimizes some aspect with respect to P. Examples include the smallest enclosing ball and the 
minimum-volume axis-aligned bounding box. For a family S we can study the shape inclusion probability 
function s MP : M d — > [0, 1] (or sip function), where s^ p (q) describes the probability that a query point q £~R d 
is included in the summarizing shap^j] For the more general types of uncertain points, there does not seem to 
be a closed form for many of these functions. In these cases we can calculate an e-sipfunction s : M. d — > [0, 1] 
such that V 9eK d \s^ p (q) — s(q)\ < e. The space required to store an e-sipfunction depends only on e and the 
complexity of the summarizing shape. 



3 Randomized Algorithm for -Quantizations 



We develop several algorithms with the following basic structure (as outlined in Algorithm 3.1l: (1) sample 
one point from each distribution to get a random point set; (2) construct the summarizing shape of the 
random point set; (3) repeat the first two steps 0({l/e 2 )(v + log(l/i5))) times and calculate a summary data 
structure. This algorithm only assumes that we can draw a random point from \x v for each pGFin constant 
time; if the time depends on some other parameters, the time complexity of the algorithms can be easily 
adjusted. 



Algorithm 3.1 Approximate [ip w.r.t. a family of shapes § or function /§ 

l: for i = 1 to m = 0((l/e 2 )(^ + log(l/(5))) do 

2: for all pj € P do 

3: Sample qj from n Pj . 

4: Set Vi = fs({qi,q 2 , ■■■ ,q n })- 

5: Reduce or Simplify the set V = {V^}^. 



Algorithm for .--quantizations. For a function / on a point set P of size n, it takes Tf(n) time to evaluate 
/(P). We construct an approximation to F^ p as follows. First draw a sample point qj from each ji p . for 
p.j € P, then evaluate Vi — /({<7i, ■ ■ ■ ,?«.})• The fraction of trials of this process that produces a value 
dominated by v is the estimate of F^ p (v). In the univariate case we can reduce the size of V by returning 
2/e evenly spaced points according to the sorted order. 

Theorem 3.1. For a distribution /ip of n points, with success probability at least 1 — 8, there exists an 
e-quantization of size 0(l/e) for F^ p , and it can be constructed in 0{Tf(n){l/e 2 )\og{\/8)) time. 

® For technical reasons, if there are (degenerately) multiple optimal summarizing shapes, we say each is equally likely to be 
the summarizing shape of the point set. 
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Proof. Because F^ p : K — > [0, 1] is an isotonic function, there exists another function g : K — > M + such that 
F^pit) — f x —_ 0o 9{%) dx where / 2 , gR 5(2;) = 1. Thus g is a probability distribution of the values of / 
given inputs drawn from /xp. This implies that an e-sample of (g, J+) is an e-quantization of F^p, since both 
estimate within e the fraction of points in any range of the form (— oo,x). This last fact can also be seen 
through a result by Dvoretzky, Kiefer, and Wolfowitz [21] . 

By drawing a random sample % from each \x Vi for pi G P, we are drawing a random point set Q from fip. 
Thus f{Q) is a random sample from <?. Hence, using the standard randomized construction for £-samples, 
0((l/e 2 ) log(l/<5)) such samples will generate an (£/2)-sample for g, and hence an (e/2)-quantization for F MP , 
with probability at least 1 — S. 

Since in an (e/2)-quantization R every value /ip(t>) is different from F^plv) by at most e/2, then we 
can take an (e/2)-quantization of the function described by hp(-) and still have an e-quantization of F^ p . 
Thus, we can reduce this to an e-quantization of size 0(l/e) by taking a subset of 2/e points spaced evenly 
according to their sorted order. □ 

Multivariate -quantizations. We can construct fc-variate e-quantizations similarly using the same basic 
procedure as in Algorithm |3.1| The output Vi of / is now fc-variate and thus results in a fc-dimensional point. 
As a result, the reduction of the final size of the point set requires more advanced procedures. 

Theorem 3.2. Given a distribution fip of n points, with success probability at least 1 — 8, we can construct a 
k-variate e -quantization for F^ p of size 0((k/e 2 )(k + log(l/<5))) and in time 0(Tf(n)(l/s 2 )(k + log(l/<5))). 

Proof. Let ft+ describe the family of ranges where a range A p = {q G R k | q d ?}■ In the fc-variate case there 
exists a function g : R k — > R+ such that F MP (v) = J x ^ v g{x) dx where / igRl g(x) dx = 1. Thus g describes 
the probability distribution of the values of /, given inputs drawn randomly from pip. Hence a random point 
set Q from /ip, evaluated as f{Q), is still a random sample from the fc-variate distribution described by g. 
Thus, with probability at least 1 — <5, a set of 0((l/e 2 )(fc + log(l/5))) such samples is an e-sample of (<7,3? + ), 
which has VC-dimension fc, and the samples are also a fc-variate e-quantization of F^ lp . Again, this specific 
VC-dimension sampling result can also be achieved through a result of Kiefer and Wolfowitz [33] . □ 

We can then reduce the size of the e-quantization R to 0((fc 2 /e) log 2/c (l/e)) in 0(\R\(k/e 3 ) log 6fc (l/e)) 
time [IH! or to 0((fc 2 /e 2 ) log(l/e)) in 0{\R\(k 3k /e 2k ) ■ log fc (fc/e)) time [13], since the VC-dimension is k and 
each data point requires 0(k) storage. 

Also on fc-variate statistics, we can query the resulting fc-dimensional distribution using other shapes with 
bounded VC-dimension v, and if the sample size is m = 0((l/e 2 )(^ + log(l/<5) )) , then all queries have at 
most e-error with probability at least 1 — 6. In contrast to the two above results, this statement seems to 
require the VC-dimension view, as opposed to appealing to the Kiefer- Wolfowitz line of work [2"T1 135] . 

3.1 (e, 5, a)-Kernels 

The above construction works for a fixed family of summarizing shapes. In this section, we show how to 
build a single data structure, an (e, 5, a)-kernel, for a distribution \ip in R dn that can be used to construct 
(e, a)-quantizations for several families of summarizing shapes. This added generality does come at an 
increased cost in construction. In particular, an (e, 5, a)-kernel of /ip is a data structure such that in any 
query direction u G § d_1 , with probability at least 1 — 5, we can create an (e, a)-quantization for the 
cumulative density function of u), the width in direction u. 

We follow the randomized framework described above as follows. The desired (e, S, a)-kernel % consists 
of a set of m = 0((l/e 2 ) log(l/5)) (a/2)-kernels, {Ki,K%, ...,K m }, where each Kj is an (a/2)-kernel 
of a point set Qj drawn randomly from fip. Given OC, with probability at least 1 — 6, we can create an 
(e, a)-quantization for the cumulative density function of width over ^p in any direction u G S d_1 . Specifically, 
let M = {w(Kj,u)}? = v 

Lemma 3.3. With probability at least 1 — 5, M is an (e, a) -quantization for the cumulative density function 
of the width of /ip in direction u. 
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Proof. The width u)(Qj, u) of a random point set Qj drawn from /ip is a random sample from the distribution 
over widths of /ip in direction u. Thus, with probability at least 1 — d, m such random samples would create 
an e-quantization. Using the width of the a-kernels Kj instead of Qj induces an error on each random sample 
of at most 2a ■ uj{Qj, u). Then for a query width w, say there are 7m point sets Qj that have width at most 
w and 7'm a-kernels Kj with width at most w; see Figure [4j Note that 7' > 7. Let w — w — 2aw. For 
each point set Qj that has width greater than w but the corresponding a- kernel Kj has width at most w, it 
follows that Kj has width greater than w. Thus the number of a-kernels Kj that have width at most w is at 
most 7m, and thus there is a width w' between w and w such that the number of a-kernels at most w' is 
exactly 7m. □ 
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Fig. 4: (e, a)-quantization M (white circles) and e-quantization R (black circles) given a query width w. 

Since each Kj can be computed in 0(n + l/a d ~ 3 / 2 ) time, we obtain: 

Theorem 3.4. We can construct an (e, S, a) -kernel for /ip on n points in R d of size 0((l/a^ d_1 ^ 2 )(l/e 2 ) • 
log(l/(5)) in 0((n + l/a d - 3 / 2 ) • (1/e 2 ) log(l/<5)) time. 

The notion of (e, a)-quantizations and (e, 5, a)-kernels can be extended to fc-dimensional queries or for a 
series of up to k queries which all have approximation guarantees with probability 1 — 5. 

Other coresets. In a similar fashion, coresets of a point set distribution fip can be formed using coresets for 
other problems on discrete point sets. For instance, sample m — 0((l/e 2 ) log(l/<5)) points sets {Pi, . . . , P m } 
each from /ip and then store a-samples {Q\ C Pi, ... , Q m C P m } of each. When we use random sampling 
in the second set, then not all distributions fx Pi need to be sampled for each Pj in the first round. This 
results in an (e, 6, a)-sample of /ip, and can, for example, be used to construct (with probability 1 — S) an 
(e, a)-quantization for the fraction of points expected to fall in a query disk. Similar constructions can be 
done for other coresets, such as e-nets [30], fc-center [5], or smallest enclosing ball [TT] . 

3.2 Measuring the Error 

We have established asymptotic bounds of m = 0((l/e 2 )(^ + log(l/<5)) random samples for constructing 
e-quantizations. Now we empirically demonstrate that the constant hidden by the big-O notation is 
approximately 0.5, indicating that these algorithms arc indeed quite practical. 

As a data set, we consider a set of n — 50 sample points in K 3 chosen randomly from the boundary of 
a cylinder piece of length 10 and radius 1. We let each point represent the center of 3-variate Gaussian 
distribution with standard deviation 2 to represent the probability distribution of an uncertain point. This 
set of distributions describes an uncertain point set /ip : M. 3n — > R + . 

We want to estimate three statistics on /zp: dwid, the width of the points set in a direction that makes an 
angle of 75° with the cylinder axis; diam, the diameter of the point set; and seb2, the radius of the smallest 
enclosing ball (using code from Bernd Gartner [25]). We can create e-quantizations with m samples from /j,p, 
where the value of m is from the set {16, 64, 256, 1024, 4096}. 

We would like to evaluate the e-quantizations versus the ground truth function F^ p ; however, it is not 
clear how to evaluate F^ p . Instead, we create another e-quantization Q with 77 = 100000 samples from /Lip, 
and treat this as if it were the ground truth. To evaluate each sample e-quantization R versus Q we find 
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the maximum deviation (i.e. d OCl (R 7 Q) = max g6 i \hn(q) — /iq(<?)|) with h defined with respect to diam or 
dwid. This can be done by for each value r £ R evaluating \hn{r) — h,Q(r)\ and \(hn(r) — 1/|-R|) — ^q(?")| 
and returning the maximum of both values over all r € R. 

Given a fixed "ground truth" quantization Q we repeat this process for r = 500 trials of R, each returning 
a d 00 (R, Q) value. The set of these r maximum deviations values results in another quantization S for each of 
diam and dwid, plotted in Figure [5j Intuitively, the maximum deviation quantization S describes the sample 
probability that doo(i?, Q) will be less than some query value. 




0.1 doo (R,Q) 0.2 0.3 0.1 doo (R,Q) 0.2 0.3 



Fig. 5: Shows quantizations of r = 500 trials for d 00 (R, Q) where Q and R measure dwid and diam. The size 
of each R is m = {16, 64, 256, 1024, 4096} (from right to left) and the "ground truth" quantization Q 
has size 77 = 100000. Smooth, thick curves are 1 — 6 = 1 — cxp(— 2me 2 + 1) where e — doo(R, Q). 

Note that the maximum deviation quantizations S are similar for both statistics (and others we tried), 
and thus we can use these plots to estimate 1 — 6, the sample probability that d ao (R, Q) < e, given a value 
m. We can fit this function as approximately 1 — 5 = 1 — exp(— me 2 /C + v) with C = 0.5 and v = 1.0. Thus 
solving for m in terms of e, v, and 6 reveals: m = C{l/e 2 )(v + log(l/<5)). This indicates the big-0 notation 
for the asymptotic bound of 0{{l/e 2 )(v + log(l/<5)) [Mj for e-samples only hides a constant of approximately 
0.5. 

We also ran these experiments to fc-variate quantizations by considering the width in k different directions. 
As expected, the quantizations for maximum deviation can be fit with an equation 1 — 5 = 1 — cxp(— me 2 /C+k) 
with C = 0.5, so m < C(l/e 2 )(k + log 1/6). For k > 2, this bound for m becomes too conservative; even 
fewer samples were needed. 

4 Deterministic Computations on Indecisive Point Sets 

In this section, we take as input a set of n indecisive points, and describe deterministic exact algorithms for 
creating quantizations of classes of functions on this input. We characterize problems when these deterministic 
algorithms can or can not be made efficient. 

4.1 Polynomial Time Algorithms 

We are interested in the distribution of the value f(Q) for each support QgT. Since there are k n possible 
supports, in general we cannot hope to do anything faster than that without making additional assumptions 
about /. Define f(7,r) as the fraction (measured by weight) of supports of CP for which / gives a value 
smaller than or equal to r. In this version, for simplicity, we assume general position and that k n can be 
described by 0(1) words, (handled otherwise in Appendix [C]) . First, we will let f(Q) denote the radius of the 
smallest enclosing disk of Q in the plane, and show how to solve the decision problem in polynomial time in 
that case. We then show how to generalize the ideas to other classes of measures. 

Smallest enclosing disk. Consider the problem where / measures the radius of the smallest enclosing disk of 
a support and let all weights be uniform so w(qij) = 1 for all i and j. Evaluating /(CP, r) in time polynomial 
in n and k is not completely trivial since there are k n possible supports. However, we can make use of the 
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(a) (b) 

Fig. 6: (a) The smallest enclosing circle of a set of points is defined by two or three points on the boundary, (b) This 
circle contains one purple (dark) point, four blue (medium) points, and two yellow (light) points. Hence there are 
1x4x2 = 8 samples that have this basis. 




Fig. 7: (a) Example input with n = 3 and k = 3. (b) One possible basis, consisting of 3 points. This basis has one 
support: the basis itself, (c) Another possible basis, consisting of 2 points. This basis has three supports, (d) 
The graph showing for each diameter d how many supports do not exceed that diameter. This corresponds to the 
cumulative distribution of the radius of the smallest enclosing disk of these points. 



fact that each smallest enclosing disk is in fact defined by a set of at most 3 points that lie on the boundary 
of the disk. For each support Q <<= T we define Bq C Q to be this set of at most 3 points, which we call the 
basis for Q. Bases have the property that f(Q) — /(Bq). 

Now, to avoid having to test an exponential number of supports, we define a potential basis to be a set of 
at most 3 points in T such that each point is from a different Pi. Clearly, there are at most (nk) 3 possible 
potential bases, and each support Q <s T has one as its basis. Now, we only need to count for each potential 
basis the number of supports it represents. Counting the number of samples that have a certain basis is 
easy for the smallest enclosing circle. Given a basis B, we count for each indecisive point P that does not 
contribute a point to B itself how many of its members lie inside the smallest enclosing circle of B, and then 
we multiply these numbers. Figure [7] illustrates the idea. 

Now, for each potential basis B we have two values: the number of supports that have B as their basis, 
and the value f(B). We can sort these 0((nk) 3 ) pairs on the value of /, and the result provides us with the 
required distribution. We spend 0(nk) time per potential basis for counting the points inside and 0(n) time 
for multiplying these values, so combined with 0((nk) 3 ) potential bases this gives 0((nk) A ) total time. 

Theorem 4.1. Let T be a set of n sets of k points. In 0((nk) 4 ) time, we can compute a data structure 
of 0((nk) 3 ) size that can tell us in 0(log(nk)) time for any value r how many supports o/QieT satisfy 
f(Q)<r. 

LP-type problems. The approach described above also works for measures / : T — > R other than the 
smallest enclosing disk. In particular, it works for LP-type problems [52] that have constant combinatorial 
dimension. An LP-type problem provides a set of constraints H and a function u> : 2 H — > E with the following 
two properties: 
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Monotonicity: For any F C G C H, u(F) < uj(G). 

Locality: For any F C G C ff with = w(G) 

and an h € H such that u(GUli) > w(G) 
implies that wjfUfc) > 

A basis for an LP-type problem is a subset B C H such that uj(B') < oj(B) for all proper subsets £?' of B. 
And we say that B is a basis for a subset G C ff if 5 C G, = w(G) and £? is a basis. A constraint 

h E H violates a basis -B if w(B Lih) > w(B). The radius of the smallest enclosing ball is an LP-type problem 
(where the points are the constraints and uj(-) = /(•)) as are linear programming and many other geometric 
problems. Let the maximum cardinality of any basis be the combinatorial dimension of a problem. 

For our algorithm to run efficiently, we assume that our LP-type problem has available the following 
algorithmic primitive, which is often assumed for LP-type problems with constant combinatorial dimension |52| . 
For a subset G C H where B is known to be the basis of G and a constraint h £ H, a violation test determines 
in O(l) time if oj(BUh) > co(B); i.e., if h violates B. More specifically, given an efficient violation test, we can 
ensure a stronger algorithmic primitive. A full violation test is given a subset G C H with known basis B and 
a constraint h £ H and determines in O(l) time if u>(B) < oo(GUh). This follows because we can test in O(l) 
time if u(B) < uj(BUh): monotonicity implies that oj(B) < u(BUh) only if w(B) < uj(B U h) < cj(GLih), 
and LOCALITY implies that uj(B) = uj(B U h) only if lu(B) = uj(G) = uj(G U h). Thus we can test if h violates 
G by considering just B and h, but if either monotonicity or locality fail for our problem we cannot. 

We now adapt our algorithm to LP-type problems where elements of each Pi are potential constraints 
and the ranking function is /. When the combinatorial dimension is a constant /3, we need to consider only 
0((nk)P) bases, which will describe all possible supports. 

The full violation test implies that given a basis B, we can measure the sum of probabilities of all supports 
of 7 that have B as their basis in 0(nk) time. For each indecisive point P such that B n P = 0, we sum 
the probabilities of all elements of P that do not violate B. The product of these probabilities times the 
product of the probabilities of the elements in the basis, gives the probability of B being the true basis. See 



Algorithm 4.1 where the indicator function applied l(f(B U {pj}) = f(B)) returns 1 if pj does not violate B 
and otherwise. It runs in 0((nfc) ,3+1 ) time. 



Algorithm 4.1 Construct Probability Distribution for /(?). 
l: for all potential bases B C Q <e 7 do 
2: for j = 1 to II do 

3: if there is a j such that Pij £ B then 

4: Set Wi — w(pij). 

5: else 

6: Set vii = Ei=i M.Pv)Hf(B U fe}) = f(B)). 

7: Store a point with value f(B) and weight (l/k n ) J\ i Wi. 



As with the special case of smallest enclosing disk, we can create a distribution over the values of / given 
an indecisive point set 7. For each basis B we calculate n(B), the summed probability of all supports that 
have basis B, and f(B). We can then sort these pairs according to the value as / again. For any query value 
r, we can retrieve /(CP, r) in 0(\og(nk)) time and it takes 0(n) time to describe (because of its long length). 

Theorem 4.2. Given a set 7 of n indecisive point sets of size k each, and given an LP-type problem 
f : 7 — > K with combinatorial dimension ft, we can create the distribution of f over 7 in 0((nfc) /3+1 ) time. 
The size of the distribution is 0(n(nk)^). 

If we assume general position of 7 relative to /, then we can often slightly improve the runtime needed to 
calculate [i{B) using range searching data structures. However, to deal with degeneracies, we may need to 
spend 0(nk) time per basis, regardless. 

If we are content with an approximation of the distribution rather than an exact representation, then it is 
often possible to drastically reduce the storage and runtime following techniques discussed in Section [3] 
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Measures that fit in this framework for points in E include smallest enclosing axis-aligned rectangle 
(measured either by area or perimeter) (/3 = 2c?), smallest enclosing ball in the L\, £2, or L M metric 
(/? = d+ 1), directional width of a set of points (/? = 2), and, after dualizing, linear programming (/3 = d). 
These approaches also carry over naturally to deterministically create polynomial-sized fc-variate quantizations. 

4.2 Hardness Results 

In this section, we examine some extent measures that do not fit in the above framework. First, diameter 
does not satisfy the locality property, and hence we cannot efficiently perform the full violation test. We 
show that a decision variant of diameter is ^P-Hard, even in the plane, and thus (under the assumption 
that #P ^ P), there is no polynomial time solution. This result holds despite the fact that diameter has a 
combinatorial dimension of 2, implying that the associated quantization has at most 0((nk) 2 ) steps. Second, 
the area of the convex hull does not have a constant combinatorial dimension, thus we can show the resulting 
distribution may have exponential size. 

Diameter. The diameter of a set of points in the plane is the largest distance between any two points. We 
will show that the counting problem of computing /(CP, r) is #P-hard when / denotes the diameter. 

Problem 4.3. PLANAR-DIAM: Given a parameter d and a set T — {Pi, ■ ■ ■ , P n } of n sets, each consisting 
of k points in the plane, how many supports Q <s 7 have f{Q) < d? 

We will now prove that Problem |4.3| is #P-hard. Our proof has three steps. We first show a special 
version of #2SAT has a polynomial reduction from Monotone #2SAT, which is #P-complete [56]. Then, 
given an instance of this special version of #2SAT, we construct a graph with weighted edges on which the 
diameter problem is equivalent to this #2SAT instance. Finally, we show the graph can be embedded as a 
straight-line graph in the plane as an instance of PLANAR-DIAM. 

Let 3CLAUSE-#2SAT be the problem of counting the number of solutions to a 2SAT formula, where 
each variable occurs in at most three clauses, and each variable is either in exactly one clause or is negated in 
exactly one clause. Thus, each distinct literal appears in at most two clauses. 

Lemma 4.4. Monotone #2SAT has a polynomial reduction to 3CLAUSE-#2SAT. 

Proof. The Monotone #2SAT problem counts the number satisfying assignments to a #2SAT instance where 
each clause has at most two variables and no variables are negated. Let X = {(x, j/i), (2,2/2), ■ • ■ , {x,y u )} 
be the set of u clauses which contain variable a; in a Monotone #2SAT instance. We replace x with u 
variables {zi, zi, ■ ■ . , z u } and we replace X with the following 2u clauses {(21, yi), {Z2, 2/2), • ■ ■ , {z u , Uu)} and 
{(z\,^Z2), (22,^2:3), . . . , (-Zjj-i, ->z u ), (z u ,^zi)}. The first set of clauses preserves the relation with other 
original variables and the second set of clauses ensures that all of the new variables have the same value (i.e. 
TRUE or FALSE). This procedure is repeated for each original variable that is in more than 1 clause. □ 

We convert this problem into a graph problem by, for each variable Xi, creating a set Pi = {p^ ,p^} of two 
points. Let S — (J i Pi. Truth assignments of variables correspond to a support as follows. If Xi is set TRUE, 
then the support includes pf , otherwise the support includes p~ . We define a distance function / between 
points, so that the distance is greater than d (long) if the corresponding literals are in a clause, and less than 
d (short) otherwise. If we consider the graph formed by only long edges, we make two observations. First, the 
maximum degree is 2, since each literal is in at most two clauses. Second, there are no cycles since a literal is 
only in two clauses if in one clause the other variable is negated, and negated variables are in only one clause. 
These two properties imply we can use the following construction to show that the PLANAR-DIAM problem 
is as hard as counting Monotone #2SAT solutions, which is #P-complete. 

Lemma 4.5. An instance of PLANAR-DIAM reduced from 3CLAUSE-#2SAT can be embedded so 9 C M 2 . 

Proof. Consider an instance of 3CLAUSE-#2SAT where there are n variables, and thus the corresponding 
graph has n sets {-Pj}™ =1 - We construct a sequence T of n' E [2n,4n] points. It contains all points from 7 
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Fig. 8: Embedded points are solid, at center of circles of radius d. Dummy points hollow. Long edges are drawn between 
points at distance greater than d. 



and a set of at most as many dummy points. First organize a sequence V so if two points q and p have a long 
edge, then they are consecutive. Now for any pair of consecutive points in V which do not have a long edge, 
insert a dummy point between them to form the sequence T. Also place a dummy point at the end of T. 

We place all points on a circle C of diameter dj cos(7r/n'), see Figurc[8] We first place all points on a 
semicircle of C according to the order of T, so each consecutive points are ir/n' radians apart. Then for every 
other point (i.e. the points with an even index in the ordering T) we replace it with its antipodal point on C, 
so no two points are within 2it /n' radians of each other. Finally we remove all dummy points. This completes 
the embedding of 7, we now need to show that only points with long edges are further than d apart. 

We can now argue that only vertices which were consecutive in T are further than d apart, the remainder 
are closer than d. Consider a vertex p and a circle C p of radius d centered at p. Let p' be the antipodal 
point of p on C. C p intersects C at two points, at 2ir /n' radians in either direction from p' . Thus only points 
within 2n jn' radians of p' are further than a distance d from p. This set includes only those points which are 
adjacent to p in T, which can only include points which should have a long edge, by construction. □ 



Combining Lemmas 4.4 and 4.5 



Theorem 4.6. PLANAR-DIAM is #P-hard. 



Convex hull. Our LP-type framework also does not work for any properties of the convex hull (e.g. area or 
perimeter) because it does not have constant combinatorial dimension; a basis could have size n. In fact, the 
complexity of the distribution describing the convex hull may be f2(fc™), since if all points in 7 lie on or near 
a circle, then every support Q <s 7 may be its own basis of size n, and have a different value f{Q). 



5 Deterministic Algorithms for Approximate Computations on Uncertain Points 

In this section we show how to approximately answer questions about most representations of independent 
uncertain points; in particular, we handle representations that have almost all (1 — e fraction) of their mass 
with bounded support in M. d and is described in a compact manner (see Appendix |a|) . Specifically, in this 
section, we are given a set 7 = {Pi, P 2 , P3, . . . , P„} of n independent random variables over the universe 
M. d , together with a set /ly = {/ii, /12, M3, ■ ■ • of n probability distributions that govern the variables, 
that is, Xi ~ /Xj. Again, we call a set of points Q = {qi, q%, qs, ■ . . , q n } a support of 7, and because of the 
independence we have probability Pr[7 = Q] =Y\i Pr[Pi = Pi}. 

The main strategy will be to replace each distribution fj,i by a discrete point set Pi, such that the uniform 
distribution over P; is "not too far" from ^ (Pi is not the most obvious e-sample of Hi). Then we apply 
the algorithms from Section [4] to the resulting set of point sets. Finally, we argue that the result is in fact 
an e-quantization of the distribution we are interested in. Using results from Section [3] we can simplify the 
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output in order to decrease the space complexity for the data structure, without increasing the approximation 
factor too much. 



General approach. Given a distribution fa : M 2 — > E + describing uncertain point Pi and a function / of 
bounded combinatorial dimension /3 defined on a support of CP, we can describe a straightforward range space 
Ti = (fa, At), where Af is the set of ranges corresponding to the bases of / (e.g., when / measures the radius 
of the smallest enclosing ball, A / would be the set of all balls) . More formally, Af is the set of subsets of R d 
defined as follows: for every set of (3 points which define a basis B for /, Af contains a range A that contains 
all points p such that f(B) = f(B U {p}). However, taking e-samples from each Ti is not sufficient to create 
sets Qi such that Q = {Qi,Q2, • • ■ ,Qn} so for all r we have |/(CP, r) — /(Q,r)| < e. 

/(CP, r) is a complicated joint probability depending on the n distributions and /, and the n straightforward 
e-samples do not contain enough information to decompose this joint probability. The required e-samplc 
of each fa should model fa in relation to / and any instantiated point pi representing p,j for i =/= j. The 
following crucial definition allows for the range space to depend on any n — 1 points, including the possible 
locations of each uncertain point. 

Let Af :7l describe a family of Lebesgue-measurable sets defined by n — 1 points Z cM. d and a value w. 
Specifically, A(Z, w) £ Af_ n is the set of points {p £ R d \ f(Z Up) < w}. We describe examples of Af_ n in 
detail shortly, but first we state the key theorem using this definition. Its proof, delayed until after examples 
of Af^ n , will make clear how (fa,Af t „) exactly encapsulates the right guarantees to approximate /(CP, r), and 
thus why (fa,Af) does not. 

Theorem 5.1. Let CP = {Pi, . . . , P n } be a set of uncertain points where each Pi ~ fa. For a function f, let 



Qi be an e' -sample of (fa, A f in ) and let Q = {Qi, ■ ■ • , Q n }- Then for any r, /(CP, r) — /(Q, r) 



< e'n. 



Smallest axis-aligned bounding box by perimeter. Given a set of points PcK 2 , let f(P) represent the 
perimeter of the smallest axis-aligned box that contains P. Let each fa be a bivariate normal distribution 
with constant variance. Solving f(P) is an LP-type problem with combinatorial dimension f3 = 4, and as 
such, we can describe the basis B of a set P as the points with minimum and maximum x- and y-coordinates. 
Given any additional point p, the perimeter of size p can only be increased to a value w by expanding the 
range of x-coordinates, y-coordinates, or both. As such, the region of K 2 described by a range A(P, w) £ Af :U 
is defined with respect to the bounding box of P from an edge increasing the x- width or y- width by (w — p)/2, 



or from a corner extending so the sum of the x and y deviation is (w — p)/2. See Figure 9(a) 

Since any such shape defining a range A(P,w) 6 Af J% can be described as the intersection of k = 4 
slabs along fixed axis (at 0°, 45°, 90°, and 135°), we can construct an (e/rp-sa mple Qi of (fa,Af^ n ) of 
size k = 0((n/e)log s (n/e)) in 0((n 6 /e 6 ) log 27 (n/e)) time [3S]. From Theorem 

2 = {Qi: ■ ■ ■ ,Qn} and any r we have /(X,r) — /(Q 



5.1 



it follows that for 



< e. 



We can then apply Theorem 4.2 to build an e-quantization of /(3C) in 0((nk) 5 ) = 0(((n 2 /e) log (n/e)) 5 ) 



0((n 1 7e 5 )log 4U (n/e)) time. The size can be reduced to 0(l/e) within that time bound. 

Corollary 5.2. Let CP = {-Pi, ■ • ■ , Pn} be a set of indecisive points where each Pi ~ fa is bivariate normal 
with constant variance. Let f measure the perimeter of the smallest enclosing axis-aligned bounding box. We 
can create an e- quantization o//(CP) in 0((n 10 /e 5 ) log 40 (n/e)) time of size 0(l/e). 



Smallest enclosing disk. Given a set of points P C M 2 , let /(P) represent the radius of the smallest 
enclosing disk of P. Let each fa be a bivariate normal distribution with constant variance. Solving f(P) is 
an LP-type problem with combinatorial dimension f3 = 3, and the basis B of P generically consists of either 
3 points which lie on the boundary of the smallest enclosing disk, or 2 points which are antipodal on the 
smallest enclosing disk. However, given an additional point p € K 2 , the new basis B p is either B or it is p 
along with 1 or 2 points which lie on the convex hull of P. 
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(a) (b) (c) 

Fig. 9: (a) A shape from Af, n for axis-aligned bounding box, measured by perimeter, (b) A shape from Af,„ for smallest 
enclosing ball using the L-i metric in R 2 . The curves are circular arcs of two different radii, (c) The same shape 
divided into wedges from W/ >n . 



We can start by examining all pairs of points pi ,pj € P and the two disks of radius w whose boundary 
circles pass through them. If one such disk Dij contains P, then Djj C A(P,w) G ■A/ipi_|_i. For this to 
hold, pi and pj must lie on the convex hull of P and no point that lies between them on the convex hull can 
contribute to such a disk. Thus there are 0(n) such disks. We also need to examine the disks created where 
p and one other point pi € P are antipodal. The boundary of the union of all such disks which contain P is 
described by part of a circle of radius 2w centered at some pi £ P. Again, for such a disk P>i to describe a 
part of the boundary of A(P, in), the point pi must lie on the convex hull of P. The circular arc defining this 
boundary will only connect two disks Dij and T)k,i because it will intersect with the boundary of Bj and B>k 
within these disks, respectively. An example of A{P,w) is shown in Figure |9(b)[ 

Unfortunately, the range space (M. 2 ,Af i7l ) has VC-dimension 0(n); it has 0(n) circular boundary arcs. So, 
creating an e-sample of T{ = (pi,Af_ n ) would take time exponential in n. However, we can decompose any 
range A(P, w) € Af. n into at most 2n "wedges." We choose one point y inside the convex hull of P. For each 
circular arc on the boundary of A(P, w) we create a wedge by coning that boundary arc to y. Let W f describe all 
wedge shaped ranges. Then S — (M 2 , W/) has VC-dimension v$ at most 9 since it is the intersection of 3 ranges 
(two halfspaces and one disk) that can each have VC-dimension 3. We can then create Q t , an (e/2n 2 )-sample 
of S, = {m,W f ), of size k = 0((n 4 / £ 2 )log(n/e)) in 0((n 2 /e) 5 + 2 ' 9 log 2+9 (n/e)) = 0((n 46 /e 23 ) log n (n/e)) 



time, via Corollary A. 2 (Appendix [Aj) . It follows that Qi is an (e/n)-sample of T{ = (/ii,Af_ n ), since any 
range A(Z, w) €E Af tU can be decomposed into at most In wedges, each of which has counting error at most 
e/2n, thus the total c ount ing error is at most e. 



Invoking Theorem 



5.1 



< e. We 



it follows that Q = {Qi, . . . , Q n }, for any r we have /(CP, r) — /(Q, r) 
can then apply Theorem |4.2| to build an e-quantization of /(CP) in 0((nk) 4 ) — 0((n 20 /e 8 ) log (n/e)) time. 



This is dominated by the time for creating the n (e/n 2 )-samples, even though we only need to build one and 
then translate and scale to the rest. Again, the size can be reduced to 0(l/e) within that time bound. 

Corollary 5.3. Let CP = {Pi, ■ . . , P n } be a set of indecisive points where each Pi ~ fii is bivariate normal with 
constant variance. Let f measure the radius of the smallest enclosing disk. We can create an e-quantization 
o//(CP) in Odn^/e^log^in/e)) time of size 0(l/e). 



Now that we have seen two concrete examples, we prove Theorem |5.1[ More examples can be found in 
Appendix [B] 



Proof of Theorem 5.1 When each Pi is drawn from a distribution /ij, then we can write /(CP, r) as the 
probability that /(CP) < r as follows. Let l(-) be the indicator function, i.e., it is 1 when the condition is true 
and otherwise. 

/(3V)=/ / ttn(Pn)l(f{{Pl,P2,- ■ ■ ,Pn}) < r) dp n dp n -i . . .dpi 
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Consider the inner most integral 

Vn(Pn)Hf({pi,P2, ■ ■ ■ ,Pn}) < r) dp n , 



where {pi,£>2 ■ ■ ■ ,Pn-i} are fixed. The indicator function l(-) = 1 is true when /({pi,P2, ■ ■ ■ >Pn-iiPn}) < r i 
and hence p n is contained in a shape A({p\, . . . ,p„_i},r) 6 Af. n . Thus if we have an e'-sample Q n for 
(fj, n ,Af t „), then we can guarantee that 

[ fJ>n(Pn)m({Pl,P2,---,Pn}) <r) dp n < j^- ^ 1 (f({Pl,P2,---,Pn-l,Pn})<r)+e'. 

We can then move the e' outside and change the order of the integrals to write: 



Hl{p\).. I H n -i(p n -i)l(f({pi, ..,p n }) < r)dp n -X..dp\ + e' . 

Jpn-1 J 

Repeating this procedure n times we get: 

— l(f({p 1 ,...,p n })<r)+e'n = f(Q,r) + s'n, 



M')4fi^) E - E 

\i=i i^ 1 '/ Pl eQi P „eQ„ 



where Q = {j i Qi- 

Similarly we can achieve a symmetric lower bound for /(DC, r). □ 



6 Shape Inclusion Probabilities 

So far, we have been concerned only with computing probability distributions of single- valued functions on a 
set of points. However, many geometric algorithms produce more than just a single value. In this section, 
we consider dealing with uncertainty when computing a two-dimensional shape (such as the convex hull, or 
MEB) of a set of points directly. 



6.1 Randomized Algorithms 

We can also use a variation of Algorithm |3.1| to construct e-shape inclusion probability functions. For a 
point set Q C K d , let the summarizing shape Sq — §(Q) be from some geometric family S so (R d ,S) has 
bounded VC-dimension v. We randomly sample m point sets Q = {Q\, ■ ■ ■ , Q m } each from fip and then find 
the summarizing shape SQ j = §(Qj) (e.g. minimum enclosing ball) of each Qj. Let this set of shapes be S Q . 
If there are multiple shapes from § which are equally optimal, choose one of these shapes at random. For a 
set of shapes S'CS, let S' p C S' be the subset of shapes that contain pet 11 . We store S Q and evaluate a 
query point p £ M. d by counting what fraction of the shapes the point is contained in, specifically returning 
\&p l/l^ Q | m 0{v\S a \) = 0{vm) time. In some cases, this evaluation can be sped up with point location data 
structures. 

To state the main theorem most cleanly, for a range space (M. d , §), denote its dual range space as (§, P*) 
where P* is all subsets § p C §, for any point p £ M d , such that S p = {5 £ § | p £ S}. Recall that the 
VC-dimension v of (S, P*) is at most 2 V +1 where v' is the VC-dimension of (E d ,§), but is typically much 
smaller. 

Theorem 6.1. Consider a family of summarizing shapes § with dual range space (§,P*) with VC-dimension 
v, and where it takes Tg(n) time to determine the summarizing shape B(Q) for any point set Q C K d of 
size n. For a distribution [ip of a point set of size n, with probability at least 1 — 6, we can construct an 
e-sipfunction of size 0((v/£ 2 )(v + log(l/<5))) and in time 0(Ts(n)(l/e 2 )\og(l/5)). 



1G 



(a) 



(b) 



(c) 



(d) 



Fig. 10: The sipfor the smallest enclosing ball (a,b) or smallest enclosing axis-aligned rectangle (c,d), for uniformly (a,c) 
or normally (b,d) distributed points. Isolines are drawn for p 6 {0.1, 0.3, 0.5, 0.7, 0.9}. 



Proof. Using the above algorithm, sample to = 0{{l/e 2 ){v + log(l/<5))) point sets Q from \i P and generate 
the m summarizing shapes Sq. Each shape is a random sample from § according to p,p, and thus S Q is an 
e-sample of (§, P*). 

Let Wfj, p (S), for S € §, be the probability that S is the summarizing shape of a point set Q drawn 
randomly from fip. For any S' CP*, let W llp (8i') = J SeS , w IJiP (S)dS be the probability that some shape 
from the subset §' is the summarizing shape of Q drawn from fip. 

We approximate the sipfunction at p € E d by returning the fraction \S® \/m. The true answer to the 



sipfunction at p € M. is W Alp (S p ). Since S is an e-sample of (S, P*), then with probability at least 1 — 5 



\s°\ W MP (S P ) 



\S Q \ W^ P (P*) 



< e. 



□ 



Representing e-sipfunctions by isolines. Shape inclusion probability functions are density functions. A 
convenient way of visually representing a density function in R 2 is by drawing the isolines. A j-isoline is a 
collection of closed curves bounding the regions of the plane where the density function is greater than 7. 



In each part of Figure 10 a set of 5 circles correspond to points with a probability distribution. In part 
(a,c) the probability distribution is uniform over the inside of the circles. In part (b,d) it is drawn from a 
normal distribution with standard deviation given by the radius. We generate £-sipfunctions for the smallest 
enclosing ball in Figure [To|[a,b) and for the smallest axis-aligned rectangle in Figure [l0fc,d). 

In all figures we draw approximations of {.9, .7, .5, .3, .l}-isolines. These drawing are generated by randomly 
selecting m — 5000 (Figure [l0]ja,b)) or to = 25000 (Figure [To|c,d)) shapes, counting the number of inclusions 
at different points in the plane and interpolating to get the isolines. The innermost and darkest region has 
probability > 90%, the next one probability > 70%, etc., the outermost region has probability < 10%. 



6.2 Deterministic Algorithms 

We can also adapt the deterministic algorithms presented in Section |4.1| to deterministically create SIP 
functions for indecisive (or e-SIPs for many uncertain) points. We again restrict our attention to a class of 
LP-type problems, specifically, problems where the output is a (often minimal) summarizing shape S(Q) of a 
data set Q where the boundaries of the two shapes S(Q) and S(Q') intersect in at most a constant number 
of locations. An example is the smallest enclosing disk in the plane, where the circles on the boundaries of 
any two disks intersect at most twice. 

As in Section |4.1[ this problem has a constant combinatorial dimension j5 and possible locations of 
indecisive poin ts can be labeled inside or outside S(Q) using a full violation test. This implies, following 
Algorithm 4.1 that we can enumerate all 0((nk)P) potential bases, and for each determine its weight towards 
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the SIP function using the full violation test. This procedure generates a set of 0((nfe)^) weighted shapes in 
0({nkY +1 ) time. Finally, a query to the SIP function can be evaluated by counting the weighted fraction of 
shapes that it is contained in. 

We can also build a data structure to speed up the query time. Since the boundary of each pair of shapes 
intersects a constant number of times, then all 0{(nk) 2 ^) pairs intersect at most 0{{nk) 2 ^) times in total. In 
R 2 , the arrangement of these shapes forms a planar subdivision with as many regions as intersection points. 
We can precompute the weighted fraction of shapes overlapping on each region. Textbook techniques can 
be used to build a query structure of size 0((nfc) 2/3 ) that allows for stabbing queries in time 0(log(nfc)) to 
determine which region the query lies, and hence what the associated weighted fraction of points is, and what 
the SIP value is. We summarize these results in the following theorem. 

Theorem 6.2. Consider a set Tofn indecisive point sets of size k each, and an LP-type problem fg : CP — > K 
with combinatorial dimension (3 that finds a summarizing shape for which every pair intersects a constant 
number of times. We can create a data structure of size 0{(nkY) in time 0((nfc) ,3+1 ) that answers SIP 
queries exactly in 0((nk)P) time, and another structure of size 0((nk) 2/3 ) in time 0((nk) 2 ^) that answers 
SIP queries exactly in 0(log(nfc)) time. 

Furthermore, through specific invocations of Theorem |5.1[ we can extend these polynomial deterministic 
approaches to create e-SIP data structures for many natural classes of uncertain input point sets. 

7 Conclusions 

In this paper, we studied the computation and representation of complete probability distributions on the 
output of single- and multi-valued geometric functions, when the input points are uncertain. We considered 
randomized and deterministic, exact and approximate approaches for indecisive and probabilistic uncertain 
points, and presented polynomial-time algorithms as well as hardness results. These results extend to when 
the output distribution is over the family of low-description-complexity summarizing shapes. 

We draw two main conclusions. Firstly, we observe that the tractability of exact computations on indecisive 
points really depends on the problem at hand. On the one hand, the output distribution of LP-type problems 
can be represented concisely and computed efficiently. On the other hand, even computing a single value of 
the output distribution of the diameter problem is already #P-hard. Secondly, we showed that computing 
approximate quantizations deterministically is often possible in polynomial time. However, the polynomials 
in question are of rather high degree, and while it is conceivable that these degrees can be reduced further, 
this will require some new ideas. In the mean time, the randomized alternatives remain more practical. 

We believe that the problem of representing and approximating distributions of more complicated objects 
(especially when they are somehow representing uncertain points), is an important direction for further study. 
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A -Samples of Distributions 



In this section we explore conditions for continuous distributions such that they can be approximated with 
bounded error by discrete distributions (point sets). We state specific results for mufti- variate normal 
distributions. 

We say a subset W C R d is polygonal approximable if there exists a polygonal shape S C R d with m 
facets such that <j>(W \ S) + 4>(S \ W) < e<fi(W) for any e > 0. Usually, m is dependent on e, for instance 
for a d-variate normal distribution m = 0((l/e d+1 ) log(l/e)) ED] • In turn, such a polygonal shape S 
describes a continuous point set where {S,A) can be given an e-sample Q using (3((l/e 2 ) log(l/e)) points if 
(S,A) has bounded VC-dimension [44] or using 0((l/e) log 2fe (l/e)) points if A is defined by a constant k 
number of directions |3H|- For instance, where A — 23 is the set of all balls then the first case applies, and 
when A = 3? 2 is the set of all axis-aligned rectangles then either case applies. 

A shape W C R d+1 may describe a distribution \i : R d — >• [0, 1]. For instance for a range space (/x, S), 
then the range space of the associated shape is (W^, 23 x R) where 23 x Hi describes balls in ~R d for the 
first d coordinates and any points in the (d + l)th coordinate. 

The general scheme to create an e-sample for (S,A), where S £ R d is a polygonal shape, is to use a 
lattice A of points. A lattice A in R d is an infinite set of points defined such that for d vectors {vx, ■ ■ ■ , Vd} 
that form a basis, for any point p £ A, p + Uj and p — w$ are also in A for any i £ [l,d]. We first create 
a discrete (e/2)-sample M = A n S of (S, A) and then create an (e/2)-sample Q of (M,A) using standard 
techniques [13l|49|. Then Q is an e-sample of (S, A). For a shape S with m (d — l)-faces on its boundary, any 
subset A' C K d that is described by a subset from (S, A) is an intersection A' = A D S for some A £ A. Since 
S has m (d— l)-dimensional faces, we can bound the VC-dimension of (S, A) as v — 0((m + vj\) log(m + va)) 
where va is the VC-dimension of (R d ,A). Finally the set M — S n A is determined by choosing an arbitrary 
initial origin point in A and then uniformly scaling all vectors {v\, . . . , Vd} until \M\ — Q((v/e 2 ) \og(v/e)) [H]. 
This construction follows a less general but smaller construction in Phillips [49] . 

It follows that we can create such an e-sample K of (S, A) of size \M\ in time 0(|M|mlog|M|) by starting 
with a scaling of the lattice so a constant number of points are in S and then doubling the scale until we get 
to within a factor of d of |M|. If there are n points inside S, it takes 0(nm) time to count them. We can 
then take another e-sample of (K,A) of size 0((u A /s 2 ) log(u A /e)) in time 0{v A Vjk |Af|((f/e 2 ) \og(v A /e)) VA ). 

Theorem A.l. For a polygonal shape S C R d with m (constant size) facets, we can construct an e-sample 
for (S, A) of size 0({v / e 2 )\og(v / e)) in time 0{m(v / e 2 ) \og 2 (v/e)), where (S,A) has VC-dimension v A and 
v = 0((v A + m) \og(v A + m)). 

This can be reduced to size 0((v A /e 2 ) log(^/e)) in time 0((1 / 1 e 2,yjl+2 )(m+v A ) \og VA {{m+v A )/e) log(m/e). 



We can consider the specific case of when W C M 3 is a d- variate normal distribution /i : K d — » R + . Then 
m = 0((l/e d )log(I/e)) and \M\ = 0((m/e 2 ) log mlog(m/e)) = (9((l/e d+2 ) log 3 (1/e)). 

Corollary A. 2. Let fi : R d — > R + be a d-variate normal distribution with constant standard deviation. 
We can construct an e-sample of (/i,A) with VC-dimension v A > 2 (and where d < v A < 1/e) of size 
0{{v A /e 2 )\og(v A /e)) in time 0((l/e 2 ^+ d + 2 ) log^ +1 (l/e)). 

For convenience, we restate a tighter, but less general theorem from Phillips, here slightly generalized. 

Theorem A. 3 ([49, 50J). Let [i : M. d — > K + be a d-variate normal distribution with constant standard deviation. 
Let (n,Qk) be a range space where the ranges are defined as the intersection of k slabs with fixed normal direc- 
tions. We can construct an e-sample of (/i, Qfc) of size 0((l/e)log 2fc (l/e)) in time 0((l/e d+4 )\og 6k+3 {l/e)). 



A.l Avoiding Degeneracy 

An important part of the above construction is the arbitrary choice of the origin points of the lattice A. This 
allows us to arbitrarily shift the lattice defining M and thus the set Q. In Section 4.1 we need to construct n 
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e-samples {Qi, . . . , Q n } for n range spaces {(Si, A), . . . , (S n , A)}. In Algorithm 4.1 we examine sets of Vji 
points, each from separate £-samples that define a minimal shape A € A. It is important that we do not have 
two such (possibly not disjoint) sets of vj\ points that define the same minimal shape A 6 A. (Note, this 
does not include cases where say two points are antipodal on a disk and any other point in the disk added 
to a set of vj\ = 3 points forms such a set; it refers to cases where say four points lie (degenerately) on the 
boundary of a disc.) We can guarantee this by enforcing a property on all pairs of origin points p and q for 
(Si, A) and (Sj,A). For the purpose of construction, it is easiest to consider only the Zth coordinates pi and 
qi for any pair of origin points or lattice vectors (where the same lattice vectors are used for each lattice) . We 
enforce a specific property on every such pair pi and qi , for all I and all distributions and lattice vectors. 

First, consider the case where A = "Rd describes axis-aligned bounding boxes. It is easy to see that if 
for all pairs p\ and qi that (pi — qi) is irrational, then we cannot have > 2d points on the boundary of an 
axis-aligned bounding box, hence the desired property is satisfied. 

Now consider the more complicated case where A = 23 describes smallest enclosing balls. There is a 
polynomial of degree 2 that describes the boundary of the ball, so we can enforce that for all pairs p\ and qi 
that (pi — qi) is of the form c\ (r pi ) 1 / 3 + c 2 (r qi ) 1 / 3 where c\ and c 2 are rational coefficients and r pi and r qi are 
distinct integers that are not multiple of cubes. Now if v = d + 1 such points satisfy (and in fact define) the 
equation of the boundary of a ball, then no (d + 2)th point which has this property with respect to the first 
d + 1 can also satisfy this equation. 

More generally, if A can be described with a polynomial of degree p with v variables, then enforce that 
every pair of coordinates are the sum of (p + l)-roots. This ensures that no v + 1 points can satisfy the 
equation, and the undesired situation cannot occur. 

B Computing Other Measures on Uncertain Points 

We generalize this machinery to other LP-type problems / defined on a set of points in M. d and with constant 
combinatorial dimension. Although, in some cases (like smallest axis-aligned bounding box by perimeter) 
we are able to show that (M. d ,Af^ n ) has constant VC-dimension, for other cases (like radius of the smallest 
enclosing disk) we cannot and need to first decompose each range A(Z,w) € At n into a set of disjoint 
"wedges" from a family of ranges W / . 



To simplify the already larg e po lynomial runtimes below we replace the runtime bound in Theorem C.2 



(below, which extends Theorem 4.2 to deal with large input sizes) with 0((nk)@ +1 log 4 (nfc)). 



Lemma B . 1 . If the disjoint union of m shapes from W / can form any shape from Af^ n , then an (e /m) -sample 
of (M,W f) is an e- sample of (M,Af jn ). 

Proof. For any shape A g Af. n we can create a set of m shapes {Wx, . . . , W n } C W/ whose disjoint union is 
A. Since each range of W^- may have error e/m, their union has error at most e. □ 

We study several example cases for which we can deterministically compute e-quantizations. For each 
case we show an example element of Af tn on an example of 7 points. 

To facilitate the analysis, we define the notion of shatter dimension, which is similar to VC-dimension. 
The shatter function 7Tt(to) of a range space T = (Y, A) is the maximum number of sets in T where \Y\ = m. 
The shatter dimension <tt of a range space T = (Y, A) is the minimum value such that 7Tt(to) = 0(m aT ). If a 
range A g A is defined by k points, then k > <tt- It can be shown [29] that ax < vt and vt = 0(ot log <7t)- 
And, in general, the basis size of the related LP-type problem is bounded f3 < ax- 

Directional Width. We first consider the problem of finding the width along a particular direction u (dwid). 
Given a point set P, f(P) is the width of the minimum slab containing P, as in Figure [Tl (a) This can be 



thought of as a one-dimensional problem by projecting all points P using the operation The directional 

width is then just the difference between the largest point and the smallest point. As such, the VC-dimension 
of (W l ,Af) is 2. Furthermore, Af^ n = Aj in this case, so (M. d ,Af :n ) also has VC-dimension 2. For 1 < i < n, 
we can then create an (e/n)-sample Qi of ([ii,Af tTl ) of size k = 0(n/e) in 0((n/e) log(n/e)) time given 
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(a) (b) 
Fig. 11: (a) Directional (vertical) width, (b) Axis-aligned bounding box, measured by area. The curves are hyperbola parts. 



basic knowledge of the distribution fa. We can then apply Theorem 4.2 to build an e-quantization in time 
0((kn) 2+1 log 4 (n 2 /e)) = 0(ri 6 /e 3 log 4 (n/e)) for the dwid case. 

We can actually evaluate /(Q, r) for all values of r faster using a series of sweep lines. Each of the 0(n 4 /e 2 ) 
potential bases are defined by a left and right end point. Each of the 0(n 2 /e) points in (J ■ Qi could be a left 
or right end point. We only need to find the 0(1 /e) widths (defined by pairs of end points) that wind up in 
the final e-quantization. Using a Frederickson and Johnson approach |23j . we can search for each width in 
0(log(n/e)) steps. At each step we are given a width uj and need to decide what fraction of supports have 
width at most uj. We can scan from each of the possible left end points and count the number of supports 
that have width at most ui. For each ui, this can be performed in 0(n 2 /e) time with a pair of simultaneous 
sweep lines. The total runtime is 0(1/ e) ■ 0(\og(n/e)) ■ 0(n 2 /e) = 0((n 2 /e 2 ) log(n/e)). 

Theorem B.2. We can create an e-quantization of size 0(l/e) for the dwid problem in 0((n 2 /e 2 ) \og(n/e)) 
time. 



Axis-aligned bounding box. We now consider the set of problems related to axis-aligned bounding boxes in 
R d . For a point set P, we minimize f(P), which either represents the d-dimensional volume of S(P) (the 
aabbv case — minimizes the area in IR 2 ) or the (d — l)-dim en sional volume of the boundary of S(P) (the 
aabbp case — minimizes the perimeter in M 2 ). Figures 9(a) 
for the aabbp case and the aabbv case in . 



and 

For both (R 2 ,Af] n ) has a shatter dimension of 4 because the 
shape is determined by the x-coordinates of 2 points and the y-coordinates of 2 points. This generalizes to a 
shatter dimension of 2d for (R d ,Af, n ), and hence a VC-dimension of 0(dlogd). The smaller VC-dimension 
in the aabbp case discussed in detail above can be extended to higher dimensions. 

Hence, for 1 < i < n, for both cases we can create an (e/n)-sample Qi of (fa,Af jn ), each of size 



11(b) show two examples of elements of Af A 



A.2 



In 



k = 0((n 2 /e 2 )log(n/e)) in total time 0(((n/e) log(n/e))°( dlog ^) via Corollary 
the e-quantization in 0((kn) 2d+l log 4 (nfc)) = 0((n 6d+3 /e 4d+2 ) \og 2d+1 (n / e)) time via Theorem 



we can construct 



4.2 



Theorem B.3. We can create an e-quantization of size 0(1/ e) for the aabbp or aabbv problem on n d-variate 
normal distributions in O(((n/e)log(n/e))°^ dloed ^) time. 



Smallest enclosing ball 

Lqo (the seboo case) and L\ (the sebi case) in 
the Li metric (the seb2 case) was shown in Figure 9(b) 



Figure 12 shows example elements of Af <n for smallest enclosing ball, for metrics 
1 An e xample element of A f >n for smallest enclosing ball for 
For seboo and sebi, (M. d ,Af, n ) has VC-dimension 2d 



because the shapes are defined by the intersection of halfspaces from d predefined normal directions. For sebi 
and seboo, we can create n (e/n)-samples Qi of each (fa, At n ) of size k = 0((n/e) log 2rf (n/e)) in total time 
0(n(n/e) d+4 log 6fe+3 (n/e)) via Theorem A. 3 We can t hen create an e-quantization in 0((nk) 2d+1 log 4 (nfc)) = 
0((n 4d + 2 /e 2d+1 )log 



4cr+2<i+4 



(n/e)) time via Theorem 4.2 
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(a) (b) 
Fig. 12: (a) Smallest enclosing ball, metric, (b) Smallest enclosing ball, L\ metric. 



Theorem B.4. We can create an e- quantization of size 0{l/e) for the seb\ or seb^ problem on n d-variate 
normal distributions in 0{{n Ad+2 / e 2d+l ) log 4d2+2d+4 (n/e)) time. 

For the seb2 case in IR 2 , (M. 2 ,Af yn ) has infinite VC-dimension, but (M. 2 ,Wf) has VC-dimension at most 9 
because it is the intersection of 2 halfspaces and one disc. Any shape A(T,w) € Af. n can be formed from 
the disjoint union of 2n wedges. Choosing a point in the convex hull of T as the vertex of the wedges will 
ensure that each wedge is completely inside the ball that defines part of its boundary. Thus, in R 2 the n 
(e/n)-samples of each (fj, Pi ,Af_ n ) are of size Xf(n,e) — 0((n 4 /e 2 ) log(n/e)) and can all be calculated in total 
time 0((n 5 /e 2 ) log 2 (n/e)). And then the e-quantization can be calculated in 0(n 56 / 3 /e 22 / 3 log 11 ^ 3 (n/e)) 
time by assuming general position of all Qi and then using range counting data structures. We conjecture 
this technique can be extended to K d , but we cannot figure how to decompose a shape A € -A/.n into a 
polynomial number of wedges with constant VC-dimension. 



Tab. 1: e-Samples for Summarizing Shape Family Af t 



case 


k 




{nkf T 


RC(k,A f ) 


runtime 


dwid 


0{n/e) 


2 


0{n A /e 2 ) 


0(1) 


0(n 5 /e 2 ) 


aabbp 


0(n 2 /e 2 ) 


2d 


0(n Gd /e 4d ) 


0(1) 


0(n 6d+1 /e id ) 


aabbv 


0(n 2 /e 2 ) 


2d 


0{n ad /e Ad ) 


0(1) 


0(n 6d+1 /e id ) 


seboo 


0(n/e) 


d+1 


0(n 2d+2 /e d+1 ) 


0(1) 


0(n 2d+:i /e d+1 ) 


sebi 


0(n/e) 


d+1 


0(n 2d+2 /e d+1 ) 


0(1) 


0(n 2d+:i /e d+1 ) 


seb 2 G R 2 


0(n 4 /e 2 ) 


3 


0{n 15 /e 6 ) 


0(n 8 / 3 /e 4/3 ) 


0(n 56/3 /e 22/ 3)) 


diam G E 2 


0(n 4 /e 2 ) 


n 


0(n 5 /s 2 ) n 


0(n 4 /e 2 ) 


0((n 5 /e 2 )™ +1 ) 



0(/(n,e)) ignores poly-logarithmic factors (log(n/e))° (poly(d)) . 



C Algorithm 4~T in RAM model 



In this section we will analyze Algorithm 4.1 without the assumption that the integer k n can be stored in 
O(l) words. To simplify the results, we assume a RAM model where a word size contains 0(\og(nk)) bits, 
each weight w(qij) can be stored in one word, and the weight of any basis f(B), the product of n 0(1) word 
weights, can thus be stored in 0(n) words. The following lemma describes the main result we will need 
relating to large numbers. 

Lemma C.l. In the RAM model where a word size has b bits we can calculate the product of n numbers 
where each is described by 0(b) bits in (ro&log 2 n)2°( log ™) time. 

Proof. Using Fiirer's recent result |24j we can multiply two m-bit numbers in m logm2°( lo s* m ) bit operations. 
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The product of n m-bit numbers has 0(log((2 m ) n ) = 0(nm) bits, and can be accomplished with n — 1 
pairwise multiplications. 

We can calculate this product efficiently from the bottom up, starting with n/2 multiplications of two 
m = 0(b) bit numbers. Then we perform n/4 multiplications of two 2m bit numbers, and so on. Since each 
operation on 0(b) bits takes 0(1) time in our model, Fiirer's result clearly upper bounds the RAM result. 
The total cost of this can be written as 

log n log n 

J2 ^ T (2 J m)log(2 J m)2°( log *( 2im » = 2nb (* + log&)2°( log *( 2Mogb » = n&lognlog(n6)2°( log *(" b ». 

i=0 i=0 

Since we assume b = 0(logn) we can simplify this bound to (nb log 2 n) 2° ( log * n > . □ 



This implies that Algorithm [41 takes 0((nk) p+1 ) + ((nk) 13 n\og(nk) log 2 n)2 0(log * ™) time where each w, 
can be described in 0(b) = 0(\og(nk)) bits. This dominates the single division and all other operations. Now 
we can rewrite Theorem 4.2 without the restriction that k n can be stored in 0(1) words. 

Theorem C.2. Given a set Q of n indecisive point sets of size k each, and given an LP-type problem 
f : Q — > K with combinatorial dimension (3, we can create the distribution of f over Q in 0((nk)^ +1 ) + 
((nk)^n log(nfe) log 2 n)2°( log n > time. The size of the distribution is 0(n(nk)P). 



2G 



